Loan Data Exploration

Preliminary Wrangling

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
df = pd.read_csv('prosperLoanData.csv')
In [4]:
print(df.describe())
print("*"*60)
print(df.info())
       ListingNumber           Term    BorrowerAPR   BorrowerRate  \
count   1.139370e+05  113937.000000  113912.000000  113937.000000   
mean    6.278857e+05      40.830248       0.218828       0.192764   
std     3.280762e+05      10.436212       0.080364       0.074818   
min     4.000000e+00      12.000000       0.006530       0.000000   
25%     4.009190e+05      36.000000       0.156290       0.134000   
50%     6.005540e+05      36.000000       0.209760       0.184000   
75%     8.926340e+05      36.000000       0.283810       0.250000   
max     1.255725e+06      60.000000       0.512290       0.497500   

         LenderYield  EstimatedEffectiveYield  EstimatedLoss  EstimatedReturn  \
count  113937.000000             84853.000000   84853.000000     84853.000000   
mean        0.182701                 0.168661       0.080306         0.096068   
std         0.074516                 0.068467       0.046764         0.030403   
min        -0.010000                -0.182700       0.004900        -0.182700   
25%         0.124200                 0.115670       0.042400         0.074080   
50%         0.173000                 0.161500       0.072400         0.091700   
75%         0.240000                 0.224300       0.112000         0.116600   
max         0.492500                 0.319900       0.366000         0.283700   

       ProsperRating (numeric)  ProsperScore  ...  LP_ServiceFees  \
count             84853.000000  84853.000000  ...   113937.000000   
mean                  4.072243      5.950067  ...      -54.725641   
std                   1.673227      2.376501  ...       60.675425   
min                   1.000000      1.000000  ...     -664.870000   
25%                   3.000000      4.000000  ...      -73.180000   
50%                   4.000000      6.000000  ...      -34.440000   
75%                   5.000000      8.000000  ...      -13.920000   
max                   7.000000     11.000000  ...       32.060000   

       LP_CollectionFees  LP_GrossPrincipalLoss  LP_NetPrincipalLoss  \
count      113937.000000          113937.000000        113937.000000   
mean          -14.242698             700.446342           681.420499   
std           109.232758            2388.513831          2357.167068   
min         -9274.750000             -94.200000          -954.550000   
25%             0.000000               0.000000             0.000000   
50%             0.000000               0.000000             0.000000   
75%             0.000000               0.000000             0.000000   
max             0.000000           25000.000000         25000.000000   

       LP_NonPrincipalRecoverypayments  PercentFunded  Recommendations  \
count                    113937.000000  113937.000000    113937.000000   
mean                         25.142686       0.998584         0.048027   
std                         275.657937       0.017919         0.332353   
min                           0.000000       0.700000         0.000000   
25%                           0.000000       1.000000         0.000000   
50%                           0.000000       1.000000         0.000000   
75%                           0.000000       1.000000         0.000000   
max                       21117.900000       1.012500        39.000000   

       InvestmentFromFriendsCount  InvestmentFromFriendsAmount      Investors  
count               113937.000000                113937.000000  113937.000000  
mean                     0.023460                    16.550751      80.475228  
std                      0.232412                   294.545422     103.239020  
min                      0.000000                     0.000000       1.000000  
25%                      0.000000                     0.000000       2.000000  
50%                      0.000000                     0.000000      44.000000  
75%                      0.000000                     0.000000     115.000000  
max                     33.000000                 25000.000000    1189.000000  

[8 rows x 61 columns]
************************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
ListingKey                             113937 non-null object
ListingNumber                          113937 non-null int64
ListingCreationDate                    113937 non-null object
CreditGrade                            28953 non-null object
Term                                   113937 non-null int64
LoanStatus                             113937 non-null object
ClosedDate                             55089 non-null object
BorrowerAPR                            113912 non-null float64
BorrowerRate                           113937 non-null float64
LenderYield                            113937 non-null float64
EstimatedEffectiveYield                84853 non-null float64
EstimatedLoss                          84853 non-null float64
EstimatedReturn                        84853 non-null float64
ProsperRating (numeric)                84853 non-null float64
ProsperRating (Alpha)                  84853 non-null object
ProsperScore                           84853 non-null float64
ListingCategory (numeric)              113937 non-null int64
BorrowerState                          108422 non-null object
Occupation                             110349 non-null object
EmploymentStatus                       111682 non-null object
EmploymentStatusDuration               106312 non-null float64
IsBorrowerHomeowner                    113937 non-null bool
CurrentlyInGroup                       113937 non-null bool
GroupKey                               13341 non-null object
DateCreditPulled                       113937 non-null object
CreditScoreRangeLower                  113346 non-null float64
CreditScoreRangeUpper                  113346 non-null float64
FirstRecordedCreditLine                113240 non-null object
CurrentCreditLines                     106333 non-null float64
OpenCreditLines                        106333 non-null float64
TotalCreditLinespast7years             113240 non-null float64
OpenRevolvingAccounts                  113937 non-null int64
OpenRevolvingMonthlyPayment            113937 non-null float64
InquiriesLast6Months                   113240 non-null float64
TotalInquiries                         112778 non-null float64
CurrentDelinquencies                   113240 non-null float64
AmountDelinquent                       106315 non-null float64
DelinquenciesLast7Years                112947 non-null float64
PublicRecordsLast10Years               113240 non-null float64
PublicRecordsLast12Months              106333 non-null float64
RevolvingCreditBalance                 106333 non-null float64
BankcardUtilization                    106333 non-null float64
AvailableBankcardCredit                106393 non-null float64
TotalTrades                            106393 non-null float64
TradesNeverDelinquent (percentage)     106393 non-null float64
TradesOpenedLast6Months                106393 non-null float64
DebtToIncomeRatio                      105383 non-null float64
IncomeRange                            113937 non-null object
IncomeVerifiable                       113937 non-null bool
StatedMonthlyIncome                    113937 non-null float64
LoanKey                                113937 non-null object
TotalProsperLoans                      22085 non-null float64
TotalProsperPaymentsBilled             22085 non-null float64
OnTimeProsperPayments                  22085 non-null float64
ProsperPaymentsLessThanOneMonthLate    22085 non-null float64
ProsperPaymentsOneMonthPlusLate        22085 non-null float64
ProsperPrincipalBorrowed               22085 non-null float64
ProsperPrincipalOutstanding            22085 non-null float64
ScorexChangeAtTimeOfListing            18928 non-null float64
LoanCurrentDaysDelinquent              113937 non-null int64
LoanFirstDefaultedCycleNumber          16952 non-null float64
LoanMonthsSinceOrigination             113937 non-null int64
LoanNumber                             113937 non-null int64
LoanOriginalAmount                     113937 non-null int64
LoanOriginationDate                    113937 non-null object
LoanOriginationQuarter                 113937 non-null object
MemberKey                              113937 non-null object
MonthlyLoanPayment                     113937 non-null float64
LP_CustomerPayments                    113937 non-null float64
LP_CustomerPrincipalPayments           113937 non-null float64
LP_InterestandFees                     113937 non-null float64
LP_ServiceFees                         113937 non-null float64
LP_CollectionFees                      113937 non-null float64
LP_GrossPrincipalLoss                  113937 non-null float64
LP_NetPrincipalLoss                    113937 non-null float64
LP_NonPrincipalRecoverypayments        113937 non-null float64
PercentFunded                          113937 non-null float64
Recommendations                        113937 non-null int64
InvestmentFromFriendsCount             113937 non-null int64
InvestmentFromFriendsAmount            113937 non-null float64
Investors                              113937 non-null int64
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
None
In [5]:
#Nice stats to plot vis a vis occupation or state
df.DebtToIncomeRatio.unique()
Out[5]:
array([0.17   , 0.18   , 0.06   , ..., 0.3108 , 0.07419, 0.23284])

Observations:

  1. Mean Loan term is approx 40 years. 75% borrowers have taken loan for 36 years.
  2. 75% Borrowers have taken loan on 25% BorrowerRate
In [6]:
df.Term.unique()
Out[6]:
array([36, 60, 12], dtype=int64)
In [7]:
#snaity check
df.head()
Out[7]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

BroserRate Distribution Plot

  1. The plot is created to see distribution of BrowerRate in the Data Set
  2. The plot shows concentration of entries till about 20% Rate.
In [8]:
#Univariate Plot
sns.distplot(df['BorrowerRate'])
plt.title('BorrowerRate Distribution')
Out[8]:
Text(0.5, 1.0, 'BorrowerRate Distribution')

Findings Loan Status, Loan Original Amount Box Plot

  1. Maximum value loan looks to be in current loan category
  2. There are more loans below 10000 in current loan category
  3. In dues categories there are maximum number of loans due in >120 days category
In [9]:
#Bi-variate box plot
plt.figure(figsize=[35,15])
base_color = sns.color_palette()[0]
ax = sns.boxplot(data=df, x='LoanStatus', y='LoanOriginalAmount', color=base_color)
ax.tick_params(labelsize=13)
ax.set_xlabel(xlabel='LoanStatus', fontsize=20)
ax.set_ylabel(ylabel='LoanOriginalAmount', fontsize=20)
plt.title('Loan Status Box LoanOriginalAmount Distribution Box Plot', fontsize=30)
Out[9]:
Text(0.5, 1.0, 'Loan Status Box LoanOriginalAmount Distribution Box Plot')
In [10]:
# checking total listingkeys
df.ListingKey.count()
Out[10]:
113937
In [11]:
# checking unique values including ListingKey uniques
df.nunique();
In [12]:
# Checking how does the duplicate rows look like, also how may rows have duplicates?
df[df.ListingKey.duplicated(keep=False)].sort_values("ListingKey")
Out[12]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
32680 00223594917038064A7C947 998257 2013-11-15 16:58:37.167000000 NaN 36 Current NaN 0.23540 0.1980 0.1880 ... -12.34 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
32681 00223594917038064A7C947 998257 2013-11-15 16:58:37.167000000 NaN 36 Current NaN 0.23540 0.1980 0.1880 ... -12.34 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
32964 00473590513960687DD308F 941296 2013-10-07 15:47:36.023000000 NaN 60 Current NaN 0.22242 0.1980 0.1880 ... -73.12 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
17274 00473590513960687DD308F 941296 2013-10-07 15:47:36.023000000 NaN 60 Current NaN 0.22242 0.1980 0.1880 ... -73.12 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
7478 0098360461900952056DB93 1190614 2014-03-02 14:21:39.583000000 NaN 36 Current NaN 0.29660 0.2574 0.2474 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
33220 0098360461900952056DB93 1190614 2014-03-02 14:21:39.583000000 NaN 36 Current NaN 0.29660 0.2574 0.2474 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
27677 01163604029146842E28D9C 1233732 2014-02-25 14:33:39.830000000 NaN 36 Current NaN 0.31975 0.2800 0.2700 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
788 01163604029146842E28D9C 1233732 2014-02-25 14:33:39.830000000 NaN 36 Current NaN 0.31975 0.2800 0.2700 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
27360 014F35910923350802E1B29 930618 2013-09-26 16:44:24.163000000 NaN 60 Current NaN 0.19323 0.1695 0.1595 ... -27.50 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
18324 014F35910923350802E1B29 930618 2013-09-26 16:44:24.163000000 NaN 60 Current NaN 0.19323 0.1695 0.1595 ... -27.50 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
18849 018A360063948152589C8BE 1136886 2014-01-14 07:56:29.517000000 NaN 60 Completed 2014-02-24 00:00:00 0.28875 0.2624 0.2524 ... -1.66 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
33959 018A360063948152589C8BE 1136886 2014-01-14 07:56:29.517000000 NaN 60 Completed 2014-02-24 00:00:00 0.28875 0.2624 0.2524 ... -1.66 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
7555 021F3593082999771F5E621 970570 2013-11-01 13:53:22 NaN 36 Current NaN 0.12691 0.0990 0.0890 ... -16.78 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
1296 021F3593082999771F5E621 970570 2013-11-01 13:53:22 NaN 36 Current NaN 0.12691 0.0990 0.0890 ... -16.78 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
19256 02AD3598720043943A56991 1097279 2013-12-26 13:26:46.030000000 NaN 36 Current NaN 0.23375 0.1960 0.1860 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 57
9460 02AD3598720043943A56991 1097279 2013-12-26 13:26:46.030000000 NaN 36 Current NaN 0.23375 0.1960 0.1860 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 57
36511 04053588643952074304508 902157 2013-09-13 22:32:56.560000000 NaN 36 Current NaN 0.27285 0.2346 0.2246 ... -15.92 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
21839 04053588643952074304508 902157 2013-09-13 22:32:56.560000000 NaN 36 Current NaN 0.27285 0.2346 0.2246 ... -15.92 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
11889 04A53597159795718CE3A26 1063352 2013-12-09 18:54:29.140000000 NaN 36 Current NaN 0.15223 0.1239 0.1139 ... -7.43 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
11888 04A53597159795718CE3A26 1063352 2013-12-09 18:54:29.140000000 NaN 36 Current NaN 0.15223 0.1239 0.1139 ... -7.43 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
66386 04C13599434217079754AEE 1102061 2013-12-29 18:39:22.483000000 NaN 36 Current NaN 0.09469 0.0809 0.0709 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
11400 04C13599434217079754AEE 1102061 2013-12-29 18:39:22.483000000 NaN 36 Current NaN 0.09469 0.0809 0.0709 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
21845 04C13599434217079754AEE 1102061 2013-12-29 18:39:22.483000000 NaN 36 Current NaN 0.09469 0.0809 0.0709 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
11117 05043603291539396A9E43E 1184734 2014-02-26 09:15:42.277000000 NaN 60 Current NaN 0.19553 0.1715 0.1615 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 49
22452 05043603291539396A9E43E 1184734 2014-02-26 09:15:42.277000000 NaN 60 Current NaN 0.19553 0.1715 0.1615 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 49
12956 059135904162159071EA963 925803 2013-09-26 06:08:44.510000000 NaN 36 Current NaN 0.23540 0.1980 0.1880 ... -19.74 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
3422 059135904162159071EA963 925803 2013-09-26 06:08:44.510000000 NaN 36 Current NaN 0.23540 0.1980 0.1880 ... -19.74 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
3424 05973602960011656FA5D4E 1188772 2014-03-01 17:24:57.483000000 NaN 36 Current NaN 0.13734 0.1089 0.0989 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
67648 05973602960011656FA5D4E 1188772 2014-03-01 17:24:57.483000000 NaN 36 Current NaN 0.13734 0.1089 0.0989 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
4270 05E135934278258503421EB 989195 2013-11-02 14:11:49.110000000 NaN 36 Current NaN 0.32446 0.2850 0.2750 ... -6.36 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
47709 FC453601437434724612543 1154037 2014-01-22 05:36:40.333000000 NaN 36 Current NaN 0.30172 0.2624 0.2524 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
53552 FC5F3592110384244FC7BA7 961598 2013-10-17 19:00:53.250000000 NaN 60 Current NaN 0.20321 0.1790 0.1690 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
27266 FC5F3592110384244FC7BA7 961598 2013-10-17 19:00:53.250000000 NaN 60 Current NaN 0.20321 0.1790 0.1690 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
33975 FC5F3592110384244FC7BA7 961598 2013-10-17 19:00:53.250000000 NaN 60 Current NaN 0.20321 0.1790 0.1690 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
54843 FC8F359832109202000D82C 1119803 2014-01-08 14:09:14.177000000 NaN 60 Current NaN 0.13227 0.1099 0.0999 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
53655 FC8F359832109202000D82C 1119803 2014-01-08 14:09:14.177000000 NaN 60 Current NaN 0.13227 0.1099 0.0999 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
52210 FCF635931956615326179DF 959753 2013-10-17 12:01:59.453000000 NaN 36 Current NaN 0.12691 0.0990 0.0890 ... -36.86 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
64316 FCF635931956615326179DF 959753 2013-10-17 12:01:59.453000000 NaN 36 Current NaN 0.12691 0.0990 0.0890 ... -36.86 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
72331 FD143593949672534EF3594 977573 2013-10-28 23:04:25.740000000 NaN 36 Current NaN 0.24205 0.2045 0.1945 ... -37.03 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
53669 FD143593949672534EF3594 977573 2013-10-28 23:04:25.740000000 NaN 36 Current NaN 0.24205 0.2045 0.1945 ... -37.03 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
18636 FD4F359960143939223E9C0 1078459 2014-01-03 09:32:03.783000000 NaN 36 Current NaN 0.28636 0.2474 0.2374 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
76889 FD4F359960143939223E9C0 1078459 2014-01-03 09:32:03.783000000 NaN 36 Current NaN 0.28636 0.2474 0.2374 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
68778 FD6335903636172174A3837 899188 2013-09-16 14:35:55.227000000 NaN 60 Current NaN 0.17839 0.1550 0.1450 ... -47.11 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
68779 FD6335903636172174A3837 899188 2013-09-16 14:35:55.227000000 NaN 60 Current NaN 0.17839 0.1550 0.1450 ... -47.11 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
59758 FD8535964303574380588AC 1046027 2013-12-03 10:57:46.427000000 NaN 36 Current NaN 0.31668 0.2774 0.2674 ... -8.42 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
36573 FD8535964303574380588AC 1046027 2013-12-03 10:57:46.427000000 NaN 36 Current NaN 0.31668 0.2774 0.2674 ... -8.42 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
48517 FD9D359492836040450E61D 1060452 2013-11-27 14:20:29.743000000 NaN 60 Current NaN 0.13227 0.1099 0.0999 ... -25.32 0.0 0.0 0.0 0.0 1.0 0 0 0.0 214
69553 FD9D359492836040450E61D 1060452 2013-11-27 14:20:29.743000000 NaN 60 Current NaN 0.13227 0.1099 0.0999 ... -25.32 0.0 0.0 0.0 0.0 1.0 0 0 0.0 214
55130 FDCB3592503114504DA2C65 981333 2013-10-15 08:57:16.180000000 NaN 36 Current NaN 0.19501 0.1585 0.1485 ... -51.56 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
53968 FDCB3592503114504DA2C65 981333 2013-10-15 08:57:16.180000000 NaN 36 Current NaN 0.19501 0.1585 0.1485 ... -51.56 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
93034 FE7436021001476584CACCD 1191338 2014-02-13 11:49:22.950000000 NaN 36 Current NaN 0.22812 0.1905 0.1805 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 67
32846 FE7436021001476584CACCD 1191338 2014-02-13 11:49:22.950000000 NaN 36 Current NaN 0.22812 0.1905 0.1805 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 67
73529 FED53589115301546EB2BEB 907826 2013-09-14 13:17:16.497000000 NaN 36 Current NaN 0.15324 0.1249 0.1149 ... -9.85 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
79541 FED53589115301546EB2BEB 907826 2013-09-14 13:17:16.497000000 NaN 36 Current NaN 0.15324 0.1249 0.1149 ... -9.85 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
80321 FEF9358974260956441C450 906461 2013-09-14 09:32:49.973000000 NaN 36 Current NaN 0.21699 0.1800 0.1700 ... -8.64 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
48533 FEF9358974260956441C450 906461 2013-09-14 09:32:49.973000000 NaN 36 Current NaN 0.21699 0.1800 0.1700 ... -8.64 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
73622 FF1B3590569175597E6830C 935199 2013-09-28 08:14:30.190000000 NaN 36 Current NaN 0.15833 0.1299 0.1199 ... -8.14 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
73623 FF1B3590569175597E6830C 935199 2013-09-28 08:14:30.190000000 NaN 36 Current NaN 0.15833 0.1299 0.1199 ... -8.14 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
57066 FFE13600149501913291BF7 1122473 2014-01-09 10:59:42.003000000 NaN 36 Current NaN 0.29455 0.2554 0.2454 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
58306 FFE13600149501913291BF7 1122473 2014-01-09 10:59:42.003000000 NaN 36 Current NaN 0.29455 0.2554 0.2454 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1

1698 rows × 81 columns

In [13]:
#seems there are completely(all columns) duplicate rows in the data set for example:
df[df['ListingKey'] == '0F563597161095613517437']
Out[13]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
29 0F563597161095613517437 1051243 2013-12-17 09:18:33.220000000 NaN 36 Current NaN 0.15223 0.1239 0.1139 ... -29.73 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
104257 0F563597161095613517437 1051243 2013-12-17 09:18:33.220000000 NaN 36 Current NaN 0.15223 0.1239 0.1139 ... -29.73 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1

2 rows × 81 columns

In [14]:
#There are 1698 such duplicate rows, which we can drop
duplicate_df = df[df.ListingKey.duplicated(keep=False)].sort_values("ListingKey").shape
duplicate_df
Out[14]:
(1698, 81)
In [15]:
#dropping the duplicates
df.drop_duplicates(subset='ListingKey', keep='first', inplace=True)
In [16]:
#checking what all columns we have in the data set 
df.columns
Out[16]:
Index(['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade',
       'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate',
       'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss',
       'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)',
       'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState',
       'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration',
       'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey',
       'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',
       'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines',
       'TotalCreditLinespast7years', 'OpenRevolvingAccounts',
       'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries',
       'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years',
       'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
       'RevolvingCreditBalance', 'BankcardUtilization',
       'AvailableBankcardCredit', 'TotalTrades',
       'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months',
       'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable',
       'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans',
       'TotalProsperPaymentsBilled', 'OnTimeProsperPayments',
       'ProsperPaymentsLessThanOneMonthLate',
       'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed',
       'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing',
       'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber',
       'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount',
       'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey',
       'MonthlyLoanPayment', 'LP_CustomerPayments',
       'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees',
       'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss',
       'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations',
       'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',
       'Investors'],
      dtype='object')

BroserRate LenderYield Scatter Plot

  1. The plot is created to see correlation between quantitative variables BorrowerRate and LenderYield
  2. As expected, the variatbles show a strong correlation.
In [17]:
sns.set(font_scale=1)
base_color = sns.color_palette()[0]
sns.scatterplot(x='BorrowerRate', y='LenderYield', data=df, color=base_color)
plt.title('BorrowerRate and LenderYield Correlation Scatter Plot', fontsize=15)
Out[17]:
Text(0.5, 1.0, 'BorrowerRate and LenderYield Correlation Scatter Plot')
In [18]:
# Checking the strong correlation as indicated by above scatter plot
df['BorrowerRate'].corr(df['LenderYield'])
Out[18]:
0.9992073582842514

Occupation Borrowers count Bar chart

  1. The plot is created to see which are the top borrowing professions
  2. Computer Programmers, Analyst and Teachers are among top 10 borrowers
In [19]:
# Which are the top borrowing professions
plt.figure(figsize=[20, 5])
df_profession_loan = df.groupby(['Occupation'])['LoanOriginalAmount'].sum().sort_values(ascending=False).reset_index(name='borrowed_sum')
base_color = sns.color_palette()[0]
ax=sns.barplot(x='Occupation', y='borrowed_sum', data=df_profession_loan.head(10), color=base_color)
plt.title('Occupation vise Borrowed Sum Bar Plot', fontsize=15)
ax.set(ylabel='Total Borrowed Amount')
Out[19]:
[Text(0, 0.5, 'Total Borrowed Amount')]

Occupation Borrowers Violin plot

  1. The plot is created to see how loan amount is distributed within a paricular occupation
  2. Teachers and Administrative Assistants seem to have more lower values loans
In [20]:
#Violin plot for top ten borrowing professions
sns.set(font_scale=3)
plt.figure(figsize=[50, 25])
base_color = sns.color_palette()[0]
sns.violinplot(x='Occupation' , y='LoanOriginalAmount', data=df[df['Occupation'].isin(['Other', 'Professional', 'Executive', 'Computer Programmer','Analyst', 'Sales - Commission', 'Accountant/CPA','Teacher','Nurse (RN)', 'Administrative Assistant'])], color=base_color)
plt.title('Occupation vise LoanOriginalAmount Violin Plot', fontsize=40)
Out[20]:
Text(0.5, 1.0, 'Occupation vise LoanOriginalAmount Violin Plot')

Correlation Heatmap

  1. The plot is created to see correlation among different columns in the Data Set

Findings

  1. Following group of columns have good postive correlation with each other: TotalCreditLinespast7years, CurrentCreditLines, OpenCreditLines, OpenRevolvingAccounts, OpenRevolvingMonthlyPayment

  2. Following group of columns have good postive correlation with each other: BorrowerAPR, BorrowerRate, LenderYield, EstimatedEffectiveYield, EstimatedLoss, EstimatedReturn

  3. TotalTrades is having good postive correlation with TotalCreditLinespast7years, CurrentCreditLines, OpenCreditLines, OpenRevolvingAccounts

In [21]:
# Multi variate plot
sns.set(font_scale=10)
select_var = ['TotalCreditLinespast7years', 'CurrentCreditLines', 'OpenCreditLines', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment',  'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn' ]
df_corr = df[select_var].copy()
plt.figure(figsize=[250, 250])
plt.yticks(fontsize=100)
ax=sns.heatmap(df_corr.corr(), vmin=0, vmax=0.5, cmap='coolwarm', square=True, annot=True, fmt='.2f', linewidth=.25)
ax.tick_params(labelbottom='on',labeltop='on')
plt.xticks(rotation=90)
plt.xticks(fontsize=100)
plt.title('Correlation Heat Map of all Columns', fontsize=300)
Out[21]:
Text(0.5, 1, 'Correlation Heat Map of all Columns')
In [22]:
#Are there any data type issues in the data set?
print(df.shape)
print(df.dtypes)
# ListingCreationDate, ClosedDate,  should be date time object
(113066, 81)
ListingKey                              object
ListingNumber                            int64
ListingCreationDate                     object
CreditGrade                             object
Term                                     int64
LoanStatus                              object
ClosedDate                              object
BorrowerAPR                            float64
BorrowerRate                           float64
LenderYield                            float64
EstimatedEffectiveYield                float64
EstimatedLoss                          float64
EstimatedReturn                        float64
ProsperRating (numeric)                float64
ProsperRating (Alpha)                   object
ProsperScore                           float64
ListingCategory (numeric)                int64
BorrowerState                           object
Occupation                              object
EmploymentStatus                        object
EmploymentStatusDuration               float64
IsBorrowerHomeowner                       bool
CurrentlyInGroup                          bool
GroupKey                                object
DateCreditPulled                        object
CreditScoreRangeLower                  float64
CreditScoreRangeUpper                  float64
FirstRecordedCreditLine                 object
CurrentCreditLines                     float64
OpenCreditLines                        float64
                                        ...   
TotalProsperLoans                      float64
TotalProsperPaymentsBilled             float64
OnTimeProsperPayments                  float64
ProsperPaymentsLessThanOneMonthLate    float64
ProsperPaymentsOneMonthPlusLate        float64
ProsperPrincipalBorrowed               float64
ProsperPrincipalOutstanding            float64
ScorexChangeAtTimeOfListing            float64
LoanCurrentDaysDelinquent                int64
LoanFirstDefaultedCycleNumber          float64
LoanMonthsSinceOrigination               int64
LoanNumber                               int64
LoanOriginalAmount                       int64
LoanOriginationDate                     object
LoanOriginationQuarter                  object
MemberKey                               object
MonthlyLoanPayment                     float64
LP_CustomerPayments                    float64
LP_CustomerPrincipalPayments           float64
LP_InterestandFees                     float64
LP_ServiceFees                         float64
LP_CollectionFees                      float64
LP_GrossPrincipalLoss                  float64
LP_NetPrincipalLoss                    float64
LP_NonPrincipalRecoverypayments        float64
PercentFunded                          float64
Recommendations                          int64
InvestmentFromFriendsCount               int64
InvestmentFromFriendsAmount            float64
Investors                                int64
Length: 81, dtype: object
In [23]:
# What are all the Alpha unique ratings in the data set
df['ProsperRating (Alpha)'].unique()
Out[23]:
array([nan, 'A', 'D', 'B', 'E', 'C', 'AA', 'HR'], dtype=object)

Prosperity Rating Count Bar Plot

  1. The plot is created to see how proper ratings are distributed in the Data Set
  2. C is the most common Prosper Rating
In [30]:
#prosper rating plot; checking how may entries below to certain rating?
sns.set(font_scale=1)
df_alfa_rating = df.groupby(['ProsperRating (Alpha)']).size().reset_index(name='prosper_rating_counts')
ax=df_alfa_rating.plot.bar(x='ProsperRating (Alpha)')
plt.title('Prosperity Rating Count Bar Plot', fontsize=15)
ax.set(xlabel='Rating(Alpha)', ylabel='Number of Borrowers')
Out[30]:
[Text(0, 0.5, 'Number of Borrowers'), Text(0.5, 0, 'Rating(Alpha)')]

Borrowers count state vise Bar plot

  1. The plot is created to see which states have more number of borrowers
  2. CA seems to have very high number of borrowers in comparison to other states
In [35]:
#Where the borrowers are from, checking using pandas groupby and bar plot?
df_state = df.groupby(['BorrowerState']).size().sort_values(ascending=False).reset_index(name='state_abbr_name')
ax=df_state.plot.bar(x='BorrowerState', figsize=(10,4))
plt.title('Borrowers Count State vise Bar Plot', fontsize=15)
ax.set(ylabel='Number of Borrowers')
Out[35]:
[Text(0, 0.5, 'Number of Borrowers')]

Borrowers count occupationtype vise Bar plot

  1. The plot is created to see which occupation types have more number of borrowers
  2. Seems highest number of borrowers in the data set are Employed
In [36]:
# More borrowers belong to which Employment status?
df.EmploymentStatus.unique()
df_employ_status = df.groupby(['EmploymentStatus']).size().sort_values(ascending=False).reset_index(name='employment_count')
ax=df_employ_status.plot.bar(x='EmploymentStatus', figsize=(10,4))
plt.title('Borrowers from different Employment Status', fontsize=15)
ax.set(ylabel='Number of Borrowers')
Out[36]:
[Text(0, 0.5, 'Number of Borrowers')]

House owners vs Non house owners borrowers count Pie chart

  1. The plot is created to see houseoweners or non house owners have more number of borrowers
  2. Seems home owner borrowes are almost same as non home owner borrowers
In [49]:
# How many borrowers are homeowners and how many are not home owners?
df.IsBorrowerHomeowner.unique()
df_homeowner = df.groupby(['IsBorrowerHomeowner']).size().reset_index(name='Number of Borrowers')
df_homeowner.rename(index={0:'Non Home Owners', 1:'Home Owners'}, inplace=True)
ax=df_homeowner.plot.pie(y='Number of Borrowers', figsize=(5,5))
ax.legend(loc='upper left')
plt.title('House Owner vs. Non House Owner(Borrowers Count) Pie Chart', fontsize=15)
Out[49]:
Text(0.5, 1.0, 'House Owner vs. Non House Owner(Borrowers Count) Pie Chart')

House owners vs Non house owners borrowered amount Pie chart

  1. The plot is created to see houseoweners or non house owners have borrowed more amount
  2. Seems, non home oweners have borrowed more than home owners.
In [48]:
# Does the amount of loan vary significantly between home owners and non home owners?
df_homeowner_borrowed_sum = df.groupby(['IsBorrowerHomeowner'])['LoanOriginalAmount'].sum().sort_values(ascending=False).reset_index(name='Borrowed Sum Classification')
df_homeowner_borrowed_sum.rename(index={0:'Non Home Owners', 1:'Home Owners'}, inplace=True)
df_homeowner_borrowed_sum.plot.pie(y='Borrowed Sum Classification', figsize=(5,5))
plt.title('House Owner vs. Non House Owner(Borrowers Amount)', fontsize=15)
Out[48]:
Text(0.5, 1.0, 'House Owner vs. Non House Owner(Borrowers Amount)')

Borrowered amount state wise bar plot

  1. The plot is created to see which states have borrowed higher amount
  2. CA seems to have not just very high number of borrowers but also borrowed amount is also highest in comparison to other states
In [50]:
# Which states borrowed more money?
df_borrowed_sum = df.groupby(['BorrowerState'])['LoanOriginalAmount'].sum().sort_values(ascending=False).reset_index(name='borrowed_sum')
ax=df_borrowed_sum.plot.bar(x='BorrowerState', figsize=(10,4))
ax.set(ylabel='Number of Borrowers')
plt.title('State vise sum of LoanOriginalAmount', fontsize=15)
Out[50]:
Text(0.5, 1.0, 'State vise sum of LoanOriginalAmount')
In [ ]: